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We generalize a sampling algorithm for lattice animals (connected clusters on a regular lattice) to a 
Monte Carlo algorithm for 'graph animals', i.e. connected subgraphs in arbitrary networks. As with 
the algorithm in [N. Kashtan et al., Bioinformatics 20, 1746 (2004)], it provides a weighted sample, 
but the computation of the weights is much faster (linear in the size of subgraphs, instead of super- 
exponential). This allows subgraphs with up to ten or more nodes to be sampled with very high 
statistics, from arbitrarily large networks. Using this together with a heuristic algorithm for rapidly 
classifying isomorphic graphs, we present results for two protein interaction networks obtained using 
the TAP high throughput method: one of Escherichia coli with 230 nodes and 695 links, and one for 
yeast {Saccharomyces cerevisiae) with roughly ten times more nodes and links. We find in both cases 
that most connected subgraphs are strong motifs (Z-scores > 10) or anti-motifs (Z-scores < —10) 
when the null model is the ensemble of networks with fixed degree sequence. Strong differences 
appear between the two networks, with dominant motifs in E. coU being (nearly) bipartite graphs 
and having many pairs of nodes which connect to the same neighbors, while dominant motifs in 
yeast tend towards completeness or contain large cliques. We also explore a number of methods 
that do not rely on measurements of Z-scores or comparisons with null models. For instance, we 
discuss the influence of specific complexes like the 26S proteasome in yeast, where a small number 
of complexes dominate the fc-cores with large k and have a decisive effect on the strongest motifs 
with 6 to 8 nodes. We also present Zipf plots of counts versus rank. They show broad distributions 
that are not power laws, in contrast to the case when disconnected subgraphs are included. 

PACS numbers: 02.70.Uu, 05.10.Ln, 87.10.+e, 89.75.Fb, 89. 75. He 



I. INTRODUCTION 

Recently, there has been an increased interest in com- 
plex networks, partly triggered by the observation that 
naturally occurring networks tend to have fat-tailed or 
even power law degree distributions P, Thus real- 
world networks tend to be very different from the com- 
pletely random Erdos-Renyi Q networks that have been 
much studied by mathematicians, and which give Pois- 
sonian degree distributions. In addition, most networks 
have further significant properties that arise either from 
functional constraints, from the way they have grown (fat 
tails, e.g., are naturally explained by preferential attach- 
ment), or for other reasons. As a consequence, a large 
number of statistical indicators have been proposed to 
distinguish between networks with different functional- 
ity (neural networks, protein transcription networks, so- 
cial networks, chip layouts, etc.) and between networks 
which were specially designed or which have grown spon- 
taneously (such as, e.g. the world wide web), under more 
or less strong evolutionary pressure. These observables 
include various centrality measures [3| , assortativity (the 
tendency of nodes with similar degree to link preferen- 
tially) Jj], clustering 0,0], different notions of modular- 
ity [M, 0, B B J properties of loop statistics [ll| , the 
small world property (i.e., slow increase of the effective 
diameter of the network with the number of nodes) p^ . 
bipartivity (the prevalence of even-sized closed walks over 
closed walks with an odd number of steps) [l3[ , and oth- 
ers. 



The frequency of specific subgraphs form a particular 
class of indicators. Subgraphs that occur more frequently 
than expected are referred to as motifs, while those oc- 
curring less frequently are anti- motifs [3, EM EB, [13] ■ 
Typically, motif search requires a null model for decid- 
ing when a subgraph is over- or under- abundant. The 
most popular null model so far has been the ensemble of 
all random graphs with the same degree sequence. This 
popularity is largely due to the fact that it can be simu- 
lated easily by means of the so-called 'rewiring algorithm' 
1^, JJi]. As we shall see, however, in the present analysis 
its value is severely limited, because it gives predictions 
that are too far from those actually observed. Other 
null models that retain more properties of the original 
network have been suggested [IJ; IIH , but have received 
much less attention. Anal ytic approaches to null models 
are discussed in Refs. [22I.I23I, [24|. 



A. Motifs and the Search for Structure 

Up to now, motif search has been mainly restricted to 
small motifs, typically with three or four nodes. Certain 
specific classes of larger subgraphs have been examined 
in Refs. [16,, ^ [s^- With the exception of Ref. [HI, 
few systematic attempts have been made to learn about 
significant structures at larger scale, by counting all pos- 
sible subgraphs (for a different approach to the discovery 
of structure than discussed here see the work on inference 
of hierarchy in Ref. ^23| ) . 
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One reason for this is that the number of non- 
isomorphic (i.e. structurally different) subgraphs in any 
but the most trivial networks increases extremely fast 
(super-exponentially) with their size. For instance, the 
number of different undirected graphs with 11 nodes is 
« 10^ 26]. Thus exhaustive studies of all possible sub- 
graphs with > 10 nodes becomes virtually impossible 
with present-day computers. But just because of this in- 
flationary growth, counts at intermediate sizes contain an 
enormous amount of potentially useful information. An- 
other obstacle is the notorious graph isomorphism prob- 
lem [23, [28] , which is in the NP class (though probably 
not NP complete |29i])- Existing state of the art programs 
for determining whether any two graphs are isomorphic 
[30I I remain too slow for our purpose. Instead, we shall 
use heuristics based on graph invariants similar to those 
put forward in Ref. [sij . where intermediate size motifs 
and anti-motifs in the protein interaction network of Es- 
cherichia coli were detected. 

The last problem when studying larger motifs, and the 
main one addressed in the present work, is the difficulty 
of estimating how often each possible subgraph appears 
in a large network, i.e. of obtaining a 'subgraph census'. 
Most studies so far were based on exact enumeration. 
In a network with N nodes, there are (^) subgraphs of 
size n. With N — 500 and n = 6, say, this number 
is « 5 X 10^^. In addition, most of the subgraphs gener- 
ated this way on a sparse network would be disconnected, 
while connected subgraphs are of more intrinsic interest. 
Thus some statistical sampling is needed. If one is willing 
to generate disconnected as well as connected subgraphs, 
then uniform sampling is simple: Just choose random n- 
tuples of nodes from the network (Blj . Uniform sampling 
connected subgraphs is less trivial. To our knowledge, 
the only work which addressed this systematically was 
Kashtan et at. [12] (for a less systematic approach, see 
also 'SS^). There, a biased sampling algorithm was put 
forward. While generating the subgraphs is fast, com- 
puting the weight factor needed to correct for the bias is 
exp[0(n)], making their algorithm inefficient for n > 7. 



B. Graph Animals 

In the present paper we exploit the fact that sampling 
connected subgraphs of a finite graph resembles sampling 
connected clusters of sites on a regular lattice. The latter 
is called the lattice animal problem [s^ , whence we pro- 
pose to call the subgraph counting problem that of graph 
animals. It is important to recognize obvious differences 
between the two cases. In particular, lattices are infinite 
and translationally invariant, while networks are finite 
and heterogeneous (disordered). For lattice animals one 
counts the number of configurations up to translations 
(i.e. per unit cell of the lattice), while on a network the 
quantity of immediate interest is the absolute number 
of occurrences of particular subgraphs. Still, apart from 
these issues, the basic operations involved in both cases 



coincide. 

Algorithms for enumerating lattice animals exactly ex- 
ist and have been pushed to high efficiency [s^, but 
are far from trivial [13] • Due to disorder, we should 
expect the situation to be even worse for graph an- 
imals. Algorithms for stochastic sampling of lattice 
animals are divided into two groups: Markov chain 
Monte Carlo (MCMC) algorithms take a connected clus- 
ter and randomly deform it while preserving connectiv- 
ity [13, [H, 113], while 'sequential' sampling algorithms 
grow the cluster from scratch [13, HH, 1431 |44{ . Even 
for regular lattices, MCMC algorithms seem less efficient 
than growth algorithms [41|. For networks, this differ- 
ence should be even more pronounced, since MCMC al- 
gorithms would dwell in certain parts of the network, and 
averaging over the different parts costs additional time. 
Thus we shall in the following concentrate only on growth 
algorithms. 

All growth algorithms similar to those in [40ll4ll . |43|. |4^ 
produce unbiased samples of percolation clusters. As ex- 
plained in Sec. II, this means that they sample clusters or 
subgraphs with non-uniform probability (for an alterna- 
tive algorithm, see jl^). Consequently, computing graph 
animal statistics requires the computation of weights to 
be assigned to the clusters, in order to correct for the 
bias. In contrast to the algorithm in Ref. [13], the cor- 
rect weights are easily and rapidly calculated in our graph 
animal algorithm. This is its main advantage. 



C. Summary 

In Sec. |TT] we present the graph animal algorithm in 
detail. The method used to handle graph isomorphism 
is briefly reviewed in Sec. III. Extensive tests, mostly 
with two protein interaction networks, one for E. coli 
with 230 nodes and 695 links I46|. and one for yeast 
with 2559 nodes and 7031 links j47], are presented in 
Sec. IV ^] . Both networks were obtained using the TAP 
high throughput method. In particular, our algorithm 
involves as a free parameter a percolation probability 
p. For optimal performance, in lattice animals p should 
be near the critical value where cluster growth perco- 
lates (4l|. We show how the performance for graph ani- 
mals depends on p, on the subgraph size n, and on other 
parameters. In Sec. V we use our sampling method to 
study these two networks systematically. We verify that 
large subgraphs with high link density are overwhelm- 
ingly strong motifs, while nearly all large subgraphs with 
low link density are anti-motifs [13, dH - although our 
data show much more structure than suggested by the 
scaling arguments of [l3]- We also find striking differ- 
ences in the strongest motifs for the two networks. Dom- 
inant motifs for the E. coli network are either bipartite or 
close to it (with many nodes sharing the same neighbors) 
while 'tadpoles' with bodies consisting of (almost) com- 
plete graphs dominate for yeast. Our conclusions and 
discussions of open problems are given in Sec. VI. 



3 



The present work only addresses undirected networks, 
but the graph animal algorithm works without major 
changes also for directed networks. Due to the larger 
number of different directed subgraphs, an exhaustive 
study of even moderately large subgraphs is much more 
challenging (i^ . 



II. THE ALGORITHM 

In this section we explain how our algorithm achieves 
uniform sampling of connected subgraphs in undirected 
networks. The graph animal algorithm executes a gener- 
alization of the Leath algorithm for lattice animals. The 
observation central to the work in Refs. [40, 41, 44] is that 
the animal and percolation ensembles concern exactly the 
same clusters. The only difference between the two en- 
sembles is that clusters in the percolation ensemble have 
different weights, while all clusters with the same num- 
ber of nodes (sites) have the same weight in the animal 
ensemble. We focus on site percolation |49]. Bond per- 
colation could also be used 41], but this would be more 
complicated and is not discussed here. 



A. Leath growth for graph animals 

For regular lattices and undirected networks we use the 
followin g ep idemic model for growing connected clusters 
of sites pO{ : 

(1) Choose a number p e [0, 1] and a maximal cluster 
size rimax- Label all sites (nodes) as 'unvisited'. 

(2) Pick a random site (node) Jq as a seed for the cluster, 
so that the cluster consists initially of only this site; mark 
it as 'visited'. 

(3) Do the following step recursively, until all boundary 
sites of the cluster have been visited, or until the cluster 
consists of riinax sites, whichever comes first: (Note that 
a boundary site of a cluster C is a site which is not in C, 
but which is connected to C by one or more edges). 

(A) Choose one of the unvisited boundary sites of the 
present cluster, and mark it as visited; (B) With proba- 
bility p join it to the cluster. 

Once a boundary site has been visited, it cannot later 
join the cluster; it either joins the cluster when it is first 
visited (with probability p) or is permanently forbidden 
to join (with probability 1 — p). 

The order in which the boundary (or 'growth') sites 
are chosen influences the efficiency of the algorithm, but 
this is irrelevant for the present discussion. The growth 
algorithm can be seen as an idealization of an epidemic 
process ('generalized' or SIR epidemic (Ell, [13]) with three 
types of individuals (Susceptible, Infected, Removed). 
Starting with a single infected individual with all others 
susceptible, the infected individual can infect neighbours 
during a finite time span. Everyone either gets infected or 
doesn't at his/her first contact. The latter are removed. 



as are the infected ones after their recovery, and do not 
participate in the further spread of the epidemic. 

Assume that for some fixed node io, a connected la- 
beled subgraph exists, which contains io and has 
n < 'T-max nodes and b visited boundary nodes. The 
chance that precisely this particular labeled subgraph 
will be chosen using the algorithm is 

Pg4p;^o) = Pnb{p;i,)=p''-\i-pf ■ (i) 

Since an independent decision is made at each boundary 
site, this is indeed the probability for n — 1 sites to be 
selected to join the cluster, while b sites are rejected. 

Denote by c{G^) the indicator function for the exis- 
tence of G^, i.e. c(G^) = 1 if the subgraph exists in the 
network, and c(G^) — else. Furthermore, denote by 
c(G^;io) the explicit indicator that G^ exists and con- 
tains the node iq. Then the total number of occurrences 
of the unlabeled subgraph G is given by 

N N 

CG = ^ CG,^ = n-'J2 E (2) 

where CG,i is the number of occurrences which contain 
node i, and where the last sum runs over all labeled sub- 
graphs G^ that are isomorphic to G. The factor takes 
into account that a subgraph with n nodes is counted n 
times. 

If we repeat the epidemic process M times, always 
starting at the same node io, then the expected number 
of times G^ occurs is 

(m(G^p,io)) = Mc(G^;zo)Pg*(p;»o) ■ (3) 

Hence, an estimator for CG,i based on the actual counts 
m{G^;P,io) after AI trials is 

CG,^oiM) = Ar' m{G';p,to)[PG4p;^oT' . (4) 

Here and in what follows carets always indicate estima- 
tors. 

More generally, the starting nodes are chosen according 
to some probability Qi^. After AI >> 1 trials in total, 
site io will have been used as starting point on average 
QigAd times. This gives then the estimator for the total 
number of occurrences of G 

N 

cg{M) = n-^Y.'^GMoM) (5) 

io = l 

N 

= (riAf)-i^Qr' E rr^{G'■.P,^)[PG^{p■,^)r'■ 

i=l G<^~G 

It is simplest to take a uniform probability Qi„ = 1/N. 
But a better alternative is to choose each node with 
a probability proportional to its degree, as nodes with 
larger degrees have more connected subgraphs attached 
to them. This is accomplished by choosing a link with 
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uniform probability where L is the total number of 
links in the network, and then choosing one of the two 
ends of this link at random. This gives 

= (2L)-ifc,. (6) 

The algorithms of ^4^,13] are directly based on Eq. (5). 
Their main drawback is that all information from clusters 
which are still growing at size n is not used. Clusters 
whose growth had stopped at sizes < n don't contribute 
to cg either, of course. Thus only those that stop growing 
exactly at size n are used in Eq. (5). This requires, among 
other things, a careful choice of p: If p is too large, too 
many clusters survive past size n, while in the opposite 
case too few reach this size at all. But even with the 
optimal choice of p, most of the information is wasted. 



B. Improved Leath method 

The major improvement comes from the following ob- 
servation [4l|: Assume that a cluster has grown to size n, 
and among the b boundary sites there are exactly g which 
have not yet been tested ('growth sites'). Thus growth 
has definitely stopped at b — g already visited boundary 
sites, while the growth on the remaining g boundary sites 
depends on future values of the random variable used 
to decide whether they are going to be infected. With 
probability (1 — py none of them are susceptible, and 
the growth will stop at the present cluster size n. Thus 
we can replace the counts m{G^;p,io) in the estimator 
for CG by the counts of 'unfinished' subgraphs, provided 
we weigh each occurrence of a subgraph isomorphic to G 
with an additional weight factor (1 — pY . Formally, this 
gives, with uniform initial Hnk selection (Eq. ([6])), 

X "/Ti unfinished {G';p,i,g). (7) 

The quantity TOunfinishGd(G^;p, <?) is the number of epi- 
demics (with parameter p) that start at node i, give a 
labeled subgraph of infected nodes, and leave g un- 
visited boundary nodes. The factor p^~"'(l — pY'^ has 
a simple interpretation. In analogy to Eq. ([T]) it is the 
probability to grow a cluster with n — 1 nodes in addi- 
tion to the start node, g growth nodes, and g blocked 
boundary nodes, 

Pnba{p:io)=p''-\l^pf-'-' ■ (8) 

Eq. ([7]) is the number of generated clusters, reweighted 
with their inverse probabilities to be sampled, given they 
exist. It is the formula we use to estimate frequencies of 
occurrences of connected subgraphs in the protein inter- 
action networks as discussed later in the text. 



C. Resampling 

In principle, Eq. ([7]) can be improved further. Ref. (4l| 
shows how to use the equivalents of Eas. (|7l8p for lattice 
animals as a starting point for a re-sampling scheme. For 
completeness, re-sampling for graph animals is briefly ex- 
plained, even though it is not used in this work. 

For each cluster that is still growing a fitness function 
is defined as 

Mp)^p'-'\i~p)-' = [Pnbg{pMVl{l-pY- (9) 

Clusters with too small fitness are killed, while clusters 
with too large fitness are cloned, with both the fitness and 
the weight being split evenly among the clones. The first 
factor in the fitness is just proportional to the weight, 
while the second factor takes into account that clusters 
with larger g have more possibilities to continue their 
growth, and thus should be more 'valuable'. The precise 
form of Eq.® is purely heuristic, but was found to be 
near optimal in fairly extensive tests. 

This resampling scheme was found to be essential, if 
one wants to sample clusters of sizes n > 100. In [4l| . 
the emphasis was on very large clusters (several thousand 
sites), and thus resampling was a necessity. Here, in con- 
trast, we concentrate on subgraphs with « 10 nodes or 
less, and stick to the simpler scheme without resampling. 
With respect to graph animals, we point out that op- 
timal fitness thresholds for pruning and cloning depend 
in a irregular network on the start node, iq, and have 
to be learned for each io separately. Although a similar 
strategy achieves success for dealing with self avoiding 
walks on random lattices jsoj , this is much more time 
consuming than for regular lattices. 



D. Implementation details 

For fast data access, we used several redundant data 
structures. The adjacency matrix was stored directly as 
a. N X N matrix with elements 0/1 and as a list of linked 
pairs i.e. as an array of size L x 2. The first is 

needed for fast checking of which links are present in a 
subgraph, while the second is the format in which the 
networks were downloaded from the web. Finally, for 
fast neighbor searches, the links were also stored in the 
form of linked lists. To test whether a site was visited 
during the growth of the present (say fc-th, k = 1 . . . M) 
subgraph, an array s[i] of size N and type unsigned int was 
used, which was initiated as s[i]=0, i — 0,...N-1. Each 
time a site i was visited, we set s[i] = k, and s[i] < k 
was used as indicator that this site had not been visited 
during the growth of the present cluster. 

In Leath-type cluster growth, there are two popular 
variants. Untested sites in the boundary can be written 
either into a first-in first-out queue, or into a stack (first- 
in last-out queue). In was found in [4l| that these two 
possibilities, whose efficiency is roughly the same when 
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Eq. (5) is used, give vastly different efffciency with Eq.(l7]), 
in particular (but not only) in combination with resam- 
pling. In that case, the first-in first-out queue gives much 
better results, and we use this method to get the numer- 
ical results shown later. 



III. SUBGRAPH CLASSIFICATION 

After sampling a labelled subgraph , one has to find 
its isomorphism class G (i.e., Gi ^ G), by testing which 
of the representatives for isomorphism classes it can be 
mapped onto by permuting the node labels. State-of-the- 
art computer programs for comparing two graphs, such 
as NAUTY [30], proceed in two steps. First, some invari- 
ants are calculated such as the number of links, traces of 
various powers of the adjacency matrix, a sorted list of 
node degrees, etc. In most cases, this shows that the 
two graphs are not isomorphic (if any of these invariants 
disagree), but obviously this does not resolve all cases. 
When ambiguities remain, each graph is transformed 
into a standard form by a suitable permutation, and 
the standard forms are compared. The standard form 
is, of course, also a special invariant, so the distinction 
between "invariants" and "standard form" might seem 
arbitrary. It becomes relevant in practice, since the user 
of the package can specify which invariants (s)he deems 
relevant, while the calculation of the standard form is at 
the core of the algorithm and cannot be changed. 

It is mostly the second step in this scheme which is time 
limiting and which renders it useless for our purposes - 
although some invariants suggested e.g. by NAUTY are 
also quite demanding in CPU time. Thus we skip the 
second step and only use invariants that are fast to com- 
pute. All these invariants, except for the number n of 
nodes and the number £ of links in the subgraph, are 
combined into a single index /, which is intended to be a 
good discriminator between all non-isomorphic subraphs 
with the same n and £. Whenever a new subgraph is 
found, the triplet (n, £, I) is calculated and compared to 
triplets that have already appeared. If the triplet ap- 
peared previously, the counter for this triplet is increased 
by 1; if not, a new counter is initiated and set to 1. 

Since no known invariant (other than standard form) 
can discriminate between any two graphs, any method 
not using it is necessarily heuristic. Some of the invari- 
ants we used are those defined in Ref. 



3l| . In addition. 



we use invariants based on powers of the adjacency ma- 
trix and of its compliment. More precisely, if Aij is the 
adjacency matrix of a subgraph, then we define its com- 
plement by Bij = 1 — Aij for i ^ j and Bij = = Aij 
for i = j. Any trace of any product A°-^ B^^ A°-^ ... is in- 
variant, and can be computed quickly. The same is true 
for the number of non-zero elements of any such product, 
and for the sum of all its matrix elements. The index / 
is then either a linear combination or a product (taken 
modulo 2^^) of these invariants. The particular choices 
were ad hoc and there is no reason to believe they are 



optimal; hence those details are not given here. 

With the indices described in [3l|, all undirected 
graphs of sizes n < 8 and all directed graphs with up 
to 5 nodes are correctly classified. In this work, a faster 
algorithm for counting loops is used; hence loop counting 
is always included, in contrast to the work of [31|. Index 
calculation based on matrix products is even faster but 
less precise: only 11112 out of all 11117 non-isomorphic 
connected graphs with ri = 8 were distinguished, and for 
directed graphs with n — b just 4 graphs out of 9608 
were missed. For larger subgraphs we were not able to 
test the quality of the indices systematically, but we can 
cite some results for n = 9. Using indices based on matrix 
products, we found 239846 different connected subgra phs 
with n = 9 in the E. coli protein interaction network [46[ 
and its rewirings. Given the fact that there are only 
261080 different connected graphs with n = 9 53], that 
many of them might not appear in the E. coli network, 
and that our sampling was not exhaustive, our graph 
classification method failed to distinguish at most 9% of 
the non-isomorphic graphs - and probably many fewer. 



IV. NUMERICAL TESTS OF THE SAMPLING 
ALGORITHM 

To test the graph animal algorithm, we first sampled 
both n = 4 and n — 5 subgraphs of the E. coli network, 
as well as n = 4 subgraphs of the yeast network. In 
these cases exact counts are possible, and we verified that 
the results from sampling agreed with results from exact 
enumeration within the estimated (very small) errors. To 
obtain these results wc used crude estimates for optimal 
p values, namely p = 0.11 for E. coli and p = 0.03 for 
yeast. For larger subgraphs more precise estimates for 
the optimal p are required. 



A. Optimal values for p 

When p is too small, only small clusters are regularly 
encountered. If p is too large, performance decreases be- 
cause the weight factors in Eq. ([7]) depend too strongly 
on the number of blocked boundary sites, b — g. The 
latter varies from instance to instance, and this can cre- 
ate huge fluctuations in the weights given to individual 
subgraphs. 

The networks we are interested in are sparse {L/N w 
const <^ N) and approximately scale- free [13]. As a 
result, most nodes have only a few links, but some 
'hubs' have very high degree. In fact, the degrees of the 
strongest hubs may diverge in the limit N ~* oo. For such 
networks it is well-known that the threshold for spread- 
ing of an infinite SIR epidemic is zero [sij . On finite net- 
works this means that one can create huge clusters even 
for minute p, and this tendency increases as N increases. 
Thus, we anticipate the optimal p to be small, and to 
decrease noticeably in going from the E. coli {N = 230) 
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FIG. 1: (color online) Root mean square relative errors of 
connected subgraph counts, Eq. (|10|l . for the yeast (n = 5 to 
8) and E. coli (n — 7) networks. In most cases, clear minima 
indicate roughly the optimum value for p, with caveats as 
explained in the text. Each data point is based on 4 x 10^ 
generated subgraphs. Smaller values of cr„(p) indicate that 
the census for subgraphs with n nodes is on average more 
precise. 



to the yeast network {N = 2559). This is, in fact, what 
we find. 

As a first test, we compute the root mean square rela- 
tive errors of the subgraph counts, averaged over all sub- 
graphs of fixed size n. Let 7„ be the number of different 
subgraphs of size n found, and let Acq be the error of 
the count for subgraph G. These errors were estimated 
by dividing the set of M independent samples into bins, 
and estimating the fluctuations from bin to bin. Then 
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(10) 



Smaller values of cr„(p) indicate that the subgraph cen- 
sus is on average more precise. Fig. [1] shows results for 
the yeast network, with various values of p and n. Also 
shown are data for the E. coli network, for n — 7. Each 
simulation used for this figure (i.e., each data point) in- 
volved M = Ax 10^ generated clusters. Our first observa- 
tion is that the results for E. coli are much more precise 
than those for yeast. This is mainly due to smaller hubs 
{k^lf = 36, while fc^^af = 141), so that much larger p 
values [5^ could be used. Also in all other aspects, our 
algorithm worked much better for the E. coli network 
than for yeast. Therefore we exhibit in the rest of this 
section only results for yeast, implying that whenever a 
test was positive for yeast, an analogous test had been 
made for E. coli with at least as good results. 

Even with the large sample sizes used in Fig. [1] many 
n = 8 subgraphs were found only once (in which case 



FIG. 2: (color online) Histograms of wP(ln«;) — w'^P{w) for 
connected n = 8 subgraphs of the yeast network. Each curve 
corresponds to one run (4 x 10^ generated subgraphs) with 
fixed value of p. Results are the more reliable, the further 
to the left is the maximum of the curve and the faster is the 
decrease of its tail at large w. 



we set Acg /cq — 1), which explains the high values 
of CTsip). This is also why we do not show any data 
for n > 8 in Fig. [1] The relative error (T„(p) for each 
n < 8 shows a broad minimum as a function of p. The 
increase in (7„ (p) at small p is because of the paucity of 
different graphs being generated. This effect grows when 
n increases, explaining why the minimum shifts to the 
right with increasing n. The increase of (Jn{p) for large 
p, in contrast, comes from large fluctuations of weights 
for individual sampled graphs. Whenp is large, the factor 
{l-pf-s in Eq.dll) can also be large, particularly in the 
presence of strong hubs. 

Unfortunately, if a subgraph is found only once, it is 
impossible to decide whether or not the frequency es- 
timate is reliable. Even for strong outliers, when the 
frequency estimate is far too large, the formal error es- 
timate cannot be larger than Acg = 0{cg)- This un- 
derestimates the true statistical errors and is partially 
responsible for the fact that the curve for n = 8 in Fig. [1] 
does not increase at large p [56| . 

A more direct understanding of the decreasing perfor- 
mance at large p comes from histograms of the (loga- 
rithms of) weight factors. Such histograms, for n = 8 
subgraphs in the yeast network, are shown in Fig. [21 
From the results in Section HIl 
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is the weight for a subgraph with n nodes, h boundary 
nodes, and g growth nodes. The algorithm produces re- 
liable estimates if P(w) decreases for large w faster than 
1/1/7^, since averages (which are weighted by w) are then 
dominated by subgraphs that are well sampled. If, in 
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FIG. 3: (color online) Scatter plots of cg{p ~ 0.025) and 
cg{p = 0.07) against cg{p = 0.04) for connected n — S sub- 
graphs of the yeast network. The clustering of the data along 
the diagonal indicates the basic reliability of the estimates, 
independent of the precise choice of p. Sample sizes were 
4 X 10^° for p = 0.04, 2.4 x 10^° for p = 0.025, and 8 x 10^ 
for p = 0.07. The latter two correspond to roughly the same 
CPU time. 



contrast, P{w) decreases more slowly, then the tail of the 
distribution dominates, and the results cannot be taken 
at face value [13 • We observe from Fig. [2] that the data 
for n = 8 is indeed reliable for p < 0.07 only. The curve 
for p ~ 0.09 in Fig. [2] also bends over at very large val- 
ues of w, indicating that even for this p our estimates 
should finally be reliable, when the sample sizes become 
sufficiently large. But this would require extremely large 
sample sizes. 

As a last test we checked whether the estimates cq 
are independent of p as they should be. Fig. [3] shows 
the estimates obtained for n = 8 subgraphs in the yeast 
network with p — 0.025 and p — 0.07 against those ob- 
tained with p = 0.04. Clearly, the data cluster along the 
diagonal - showing that the estimates are basically cor- 
rect. They scatter more when the counts are lower (i.e. 
in the lower left corner of the plot). The asymmetries 
in that region result from the fact that rarely occurring 
subgraphs are completely missed for p — 0.04 and even 
more so for p = 0.025, cutting off thereby the distribu- 
tions at small cq. For larger counts, the estimates for 
p — 0.025 are more precise than those for p ~ 0.07. The 
latter show high weight "glitches" arising from the tail of 
P{w) discussed earlier in this section. 

For increasing p, the numbers mc of generated sub- 
graphs of type G increase of course (as the epidemic 
survives longer), so that average weights, defined as 
(wg) = ccM/mG, decrease. But this decrease is not uni- 
form for all G. Rather, it is strongest for fully connected 



subgraphs {£ = n{n— l)/2), and is weakest for trees. For 
the yeast network and n = 8, e.g., {wq) averaged over all 
trees decreases by a factor ~ 18 when p increases from 
0.025 to 0.085, while {wg) averaged over all graphs with 
£ > 25 decreases by a factor ~ 1700. Smaller values of 
(wg) are preferable, as they imply smaller fluctuations. 
Thus it would be most efficient to use larger p values for 
highly connected subgraphs, and smaller p for tree-like 
subgraphs. Counting very highly connected subgraphs - 
where every node has a degree in the subgraph > /eq, say 
- is also made easier by first reducing the network to its 
fc-core with k — ko, and then sampling from the latter. 



V. RESULTS 
A. Characterization of the networks 

As already stated, both networks as we use them are 
fully connected [i^. The E. coli network has 230 nodes 
and 695 links, while the yeast network has 2559 nodes 
and 7031 links. Both networks show strong clustering, 
as measured by the clustering coefficients Q 



ki{ki 1) 



j<m 



(12) 



where ki is the degree of node i and the sum runs over 
all pairs of nodes linked directly to i. In Fig. 2] we show 
averages of Gi over all nodes with fixed degree k. We see 
that {G)k is quite large, but has a noticeably different 
dependence on k for the two networks. While it decreases 
with k for E. coli, it attains a maximum at fc « 15 for 
yeast. 

The unweighted average clustering C — N^^ J2iLi 
is 0.1947 for yeast, and 0.2235 for E. coli. Due to the 
different behavior of {G)k, the ranking is reversed for the 
weighted averages 
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where ua is the number of fully connected triangles on 
the network and ny is the number of triads with two 
links (see @ for a somewhat different formula). Numer- 
ically, this gives (C) = 0.1948 for yeast and 0.1552 for 
E. coli. This can be understood as a consequence of the 
fact that the relative frequency of fully connected trian- 
gles is higher in yeast than in E. coli: in yeast (E. coli) 
there are 6969 (478) triangles compared to 86291 (7805) 
triads with two links. 

Associated with this difference are distinctions between 
the fc-cores ,58] of the two networks. Fig. [5] shows the 
sizes of the fc-cores against k. We see that the yeast net- 
work contains non-empty cores with k up to 15. More- 
over, the core with A: = 15 has exactly 17 nodes. It is 
a nearly fully connected subgraph with just one missing 
link. All 17 proteins in this core are parts of the 26S 
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FIG. 4: (color online) Average clustering coefficients for nodes 
with fixed degree k plotted versus the degree, for the giant 
component of the yeast and E. coli protein interaction net- 
works. While the clustering coefficient decreases with k for 
E. coll, it attains a maximum at fc ~ 15 for yeast. 
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FIG. 6: (color online) Counts for connected subgraphs with 
fixed topology and with n < 8 in the E. coli network, plotted 
against r? + 21. The variable r? + 2i is used to spread out 
the data, so that the dependence on both n and i (number of 
links) can be seen independently, without data points over- 
lapping. For most of the points, the error bars are smaller 
than the sizes of the symbols. 
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FIG. 5: (color online) Sizes of the fc— cores for the two net- 
works, plotted against k. Notice that the fc— cores for yeast 
contain a nearly fully connected cluster with 17 nodes. In 
addition to the core sizes for the original networks, the figure 
also shows average core sizes for rewired networks as discussed 
in section V C. 



proteasome which consists of 20 or 21 proteins (sol, [60|. 
All these proteins presumably interact very strongly with 
each other. When the interactions between the proteins 
within the 26S proteasome are taken out (the correspond- 
ing elements of the adjacency matrix are set to zero), 
the fc-core with highest fc has fc = 12 and consists of 15 
nodes. All its nodes correspond to proteins in the medi- 
ator complex of RNA polymerase II [1^ , which contains 
20 proteins altogether. After eliminating all interactions 



between these, two 11-cores with respectively 13 and 14 
nodes remain, the first corresponding to the 20S protea- 
some and the second corresponding to the RSC complex 
[59| . Again these particular complexes have only a few 
more proteins than those contained within their largest 
fc-cores, so they are very tightly bound together. All 
remaining complexes appear to be more loosely bound, 
so that much of the strong larger scale clustering in the 
yeast network (involving 7-10 nodes) can be traced to 
only a few tightly bound complexes. This has a big effect 
on the subgraph counts, as we shall see. 



B. Trends in Subgraph counts 

Subgraph counts cq for the Escherichia coli and yeast 
networks, plotted against r? -j- 2£, are shown in Figs. [H 
and[71 For large n we see a very wide range, with counts 
varying between 1 and > 10*. In general, counts decrease 
with increasing number of links, i.e. trees are most fre- 
quent. This is a direct consequence of the fact that the 
networks are sparse. Even when n and I are fixed, the 
counts CG can range over six orders of magnitude (e.g. 
for yeast with n = 8 and i ^ 17). 

For the yeast network, there are clear systematic trends 
for the counts at fixed n and i. The most frequent sub- 
graphs are those with strong heterogeneity, i.e. with a 
large variation of the degrees (within the subgraph) of 
nodes, while the most rare are those with minimal varia- 
tion. Fig. [8] shows the counts cq for n = % and with four 
different values of £ plotted against the variance of the 
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FIG. 7: (color online) Counts for subgraphs with fixed topol- 
ogy and with n < 8 in the yeast network, plotted against 
+ 2£ as in Fig. [1 



degrees of the nodes within the subgraph, 

-^ = ^Efc?-[^E^»^ (14) 
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FIG. 8: (color online) Counts for n — 8 subgraphs of the 
yeast network with £ — 7, 10, 13, and 18, plotted against the 
variance of the node degrees within the subgraphs, as given 
by Eq. 1141 Zero variance means that all nodes have exactly 
the same degree, whereas a higher variance indicates that the 
nodes differ more widely. Typically, subgraphs with more 
variation in their nodes (and thus with larger a^) have higher 
counts than those for which the degrees within the subgraph 
are more uniform. 



For all four curves we see a trend, where the count in- 
creases with cr, but hardly any trend like this is seen for 
the E. coli network (data not shown). The effect seen in 
the yeast data is probably related to the very strongly 
connected core in that network (see the last subsection). 
As we shall also see later in subsection D, subgraphs 
with high counts in yeast often have a tadpole form with 
a highly connected body (which is part of one of the 
densely connected complexes discussed in the last subsec- 
tion) and a short tail attached to it. These cores may also 
be responsible for the main difference between Figs.[S]and 
[21 namely the strong representation of very highly con- 
nected (large i) subgraphs in the yeast network. Taking 
out all interactions within the 26S and 20S proteasomes, 
within the mediator complex and within the RSC com- 
plex reduces substantially the counts for highly connected 
subgraphs. The count for the complete n = 7 subgraph, 
e.g., is reduced in this way from 25, 164 ± 68 to 682 ± 23. 
The removal of interactions within the 26S proteasome 
makes by far the biggest contribution. 



C. Zipf plots 

In [3l[ it was found that "Zipf plots" (subgraph counts 
vs. rank) in the E. coli network exhibit power law be- 
havior, whose origin is not yet understood. The essen- 
tial difference between the subgraph counts in [3l| and 
in the present paper is that we sample only connected 
subgraphs, while all subgraphs with given n were ranked 



in [31]. Also, noting that disconnected subgraphs are 
more likely to be sampled than connected ones when 
picking nodes at random (due to the sparsity of the net- 
works) , we can go to much higher ranks for the connected 
subgraphs. 



Zipf plots for connected subgraphs in the E. coli net- 
work are shown in Fig. [9l Each curve is based on 4x10^ 
to 10"'^'' generated subgraphs. Each is strongly curved, 
suggesting that there are no power laws - at least for 
subgraph sizes where we obtain reasonable statistics for 
the census. The curves show less curvature for larger n, 
but this is a gradual effect. It seems that the scaling 
behavior found in [3l| was mainly due to the presence 
of disconnected graphs, although it is not immediately 
obvious why those should give scale-free statistics either. 
In addition, the right hand tails of the Zipf plots in [3lj | 
were cut off because of substantially lower statistics. In 
our case, apparently sharp cutoffs in the counts are ob- 
served for ranks w 1.08 x 10* for n = 8, w 2.1 x 10^ 
for n = 9, and « 2.9 x 10^ for n = 10. For n < 9 
these are close to the total number of different connected 
subgraphs [11], suggesting that we have fairly complete 
statistics. For n = 10 the cutoff is more affected by lack 
of statistics, but it is still within a factor of four of the 
upper limit. 
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FIG. 9: (color online) "Zipf plots showing the counts for 
individual connected subgraphs with fixed n, plotted against 
their rank. Data are for the E. coli network. 
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FIG. 11: (color online) Same as Fig. 1101 but for the yeast 
network. Notice that most data points for large n and £ are 
missing. Indeed, for n = 7 all (!) data points with £ > 16 are 
missing, because no such subgraphs were found in the rewired 
ensemble. 
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FIG. 10: (color online) Ratios between the count estimates 
CO for connected subgraphs in the E. coli network, and the 
corresponding average counts (cq') in rewired networks. The 
data are plotted against n^+2£, again to spread the points out 
conveniently. Most error bars are smaller than the symbols. 



to be the ensemble of networks vifith the same degree 
sequence, obtained by the rewiring method. The aver- 
age subgraph counts in the null ensemble are denoted as 
(cq^)- In Figs. [TOl and fTTl we plot the ratios cg/{c-q) 
against the variable n? + 2£ for each connected subgraph 
that was sampled both in the original graph and in at 
least one of the rewired graphs. The error bars, which 
include both statistical errors from sampling and the en- 
semble fluctuations of the null model estimated from 
several hundred rewired networks, are for most points 
smaller than the symbols. A subgraph is a motif (anti- 
motif), if this ratio is significantly larger (smaller) than 
1. Notice that motifs do not in general occur partic- 
ularly frequently in the original network. Even with- 
out rigorous estimates to estimate significance, it is clear 
that most densely connected subgraphs are motifs in the 
yeast network. The fact that trees or subgraphs with few 
loops tend to be anti-motifs might not be so evident from 
Fig. [Til since the ratios for trees and tree-like graphs are 
close to one. Thus we have to discuss significance more 
formally. 



D. Null model comparison and motifs 



Z-scores 



One of the most striking results of [3l|] was that most 
large subgraphs were either strong motifs or strong anti- 
motifs. However, this finding was based on rather lim- 
ited statistics and on a single protein interaction network. 
One of the purposes of the present study is to test this 
and other results of [3l[ with much higher statistics and 
for a larger network, the protein interaction network of 
yeast. 

To define a motif requires a null model. We take this 



Usually [3jJ, the significance of a motif (or anti-motif) 
is measured by its Z-score 
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is the standard deviation of cq within the null 
A subgraph is a motif (anti-motif), if Z ^ 1 
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FIG. 12: The eight strongest motifs with n — 7 in the E. coli 
protein interaction network. These tend to be almost bipar- 
tite graphs, and many pairs of nodes are Unked to the same 
set of neighbors. Their Z-scores, in order from left to right, 
first then second row, are: 2.9 x 10*, 932, 885, 648, 595, 532, 516 
and 377. Their estimated frequencies in the original E. coli 
network are, in the same order: 20936 ±8, 161521 ±63, 8312 ± 
5, 1331 ± 2, 838 ± 2, 5985 ± 5, 5165 ± 4, and 519 ± 1. 



The eight strongest motifs with n — 7 in the E. coli net- 
work according to this definition are shown in Fig. [T^l to- 
gether with their Z-values. To name the strongest motifs 
in the yeast network is less straight forward, since many 
subgraphs did not show up in any rewired network at all. 
Assuming for those subgraphs (Tq^ ~ {<^cP) — would 
give Z — oo. Rough lower bounds on Z are obtained for 

them by assuming that (cq^) < l/R and CTq"* < 
where R is the number of rewired networks that were 
sampled, giving Z > CG\fR- Some of the strongest mo- 
tifs in the yeast network, together with their estimated 
Z-scores, are shown in Fig. [131 Note that no n = 7 graphs 
with £ > \Q were found in any of the realizations of the 
null model, while they were all found in the real yeast 
network. Hence these are all strong motifs. Those mo- 
tifs in Fig.[T3]for which only lower bounds for the ^-score 
are given are the most frequent in the real network, hence 
they have the highest lower bound. It was pointed out 
in [3^ [sj that cliques (complete subgraphs) are in gen- 
eral very strong motifs. In yeast, the n = 7 clique (with 
i = 21) is indeed a very strong motif, but it does not have 
the largest lower bound on the Z-score. In comparison, 
anti-motifs have rather modest Z-scores. The strongest 
anti-motif with n = 7 has Z = -32.9 {Z = -24.7) for 
E. coli (yeast). 

With Z-values up to 10'' and more, as in Fig. [T21 the 
motivation for using Z-scores becomes suspect. On the 
one hand, the null model is clearly unable to describe the 
actual network, and has to be replaced by a more refined 
null model. This will be done in a future paper [i^. On 
the other hand, it suggests to use instead a Z-score based 
on logarithms of counts, 

logCG- (logcg^) 
'^log (5) > l-Loj 

'^log.G 

where af^g q is the standard deviation of log Cq^ . An 
advantage of Eq. (fTB)l would be that it suppresses \Z\ for 




FIG. 13: Eight very strong motifs with n = 7 for the yeast 
protein interaction network. These tend to be almost com- 
plete graphs with a single dangling node. Four of these 
graphs were not seen in any realization of the null model, 
so only lower bounds on their ^-scores can be given. From 
left to right, first then second row, the estimated Z-scores are: 

> 3xlo^9xlo^> 8xlo^5xlo^> 4xlo^3xlo^2.5xlo^ 

and > 1.5 x lO". Estimated frequencies are, in the same 
order: 6.68(1) x 10^ 9.27(5) x lO", 1.76(1) x 10^ 4.84(1) x 
10^ 7.78(2) X 10'', 3.13(6) xlO^ 1.38(1) xl0^ and 3.35(1) x 10*. 



motifs, but enhances \Z\ for anti-motifs. 

In general, strong yeast motifs have a tadpole structure 
with a complete or almost complete body, and a tail con- 
sisting of a few nodes with low degree. This agrees nicely 
with our previous observation that frequently occurring 
subgraphs in the yeast network have strong heterogeneity 
in the degrees of their nodes. In contrast, strong E. coli 
motifs with not too many loops are all based on a 4-3 or 
5-2 bipartite structure. When the number of loops in- 
creases, strictly bipartite structures are impossible, but 
the tendency towards these structures is still observed. 

Whether we use Z-scores or the ratio Cg/Cq^ to iden- 
tify motifs makes very little difference. Using either crite- 
rion, the strengths of the strongest motifs skyrocket with 
subgraph size. This is most dramatically apparent for the 
yeast network. Indeed, correlations between Z-scores of 
individual graphs in the yeast and E. coli networks (data 
not shown) are much weaker than correlations between 
count ratios. The latter are shown in Fig. [T3] for n = 7 
subgraphs. 



2. Twinning versus Clustering 

Another characteristic feature of strong motifs in the 
E. coli network is the tendency for 'twin' nodes. We call 
two nodes in a subgraph twins if they are connected to 
the same set of neighbours in the subgraph. Otherwise 
said, nodes i and k are twins, iff the z-th and fc-th rows of 
the subgraph adjacency matrix are identical. Notice that 
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FIG. 14: (color online) Count ratios cg/{c^q) for individual 
subgraphs in the E. coli network, plotted against the count 
ratio for the same subgraph in the yeast network. To highlight 
the dependence on the number of twin nodes in the subgraph, 
subgraphs with ntwin > 1 (jitwin > 3) are marked by asterisks 
(bullets). Whereas almost all ratios are much higher in the 
yeast network, this is noticeably less true for subgraphs con- 
taining more than three pairs of twin nodes. These tend to 
fall on the diagonal indicated by the dashed line. 



twin nodes can be created most naturally by duplicating 
genes. We found that subgraphs with many pairs of twin 
nodes are in general also motifs in the yeast network, 
but they do not stand out spectacularly from the mass of 
other motifs. They could be the 'genuine' motifs also for 
yeast, but only a better null model where all subgraphs 
actually occur with reasonable frequency would be able 
to prove or disprove this. 

In Fig. [T3] we also indicated the dependence on the 
number ntwin of pairs of twin nodes, by marking sub- 
graphs with ntwin > 3 (ntwin > 1) by bullets (asterisks). 
We see that all strong motifs in E. coli have multiple 
pairs of twin nodes. These subgraphs tend to be also 
motifs of comparable strength in yeast - the bullets in 
Fig-HHtend to cluster on the diagonal [cg/ {c'g)]e. coU — 

[cgI {cQ')]yeast- Howcver, there are even stronger motifs 
in yeast that have no twin nodes. These graphs are typ- 
ically much weaker motifs or not motifs at all in E. coli. 

As we have already indicated, many of the strong mo- 
tifs in yeast seem to be related to a few densely connected 
complexes such as those discussed in subsection A. They 
are either part of their cores, or they have most of their 
nodes in the core, with one or two extra nodes forming 
the tail of what looks like a tadpole. This effect is even 
more pronounced for n = 8 subgraphs. For instance, the 
three most frequent subgraphs with n — S and t — 11 all 
contained a 6-clique and two nodes connected to it either 
in chain or in parallel. None of them occurred even in a 
single rewired network. 
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FIG. 15: (color online) Counts cq resp. {c^q) for individual 
subgraphs in the Escherichia coli network, plotted against 
counts for the same subgraph in the yeast network. It can 
be seen that the two rewired networks are much more similar 
(display higher correlation) than the original networks. 



The situation is different for the E. coli network. 
There, the three most frequent graphs with 8 nodes and 
17 edges also have a tadpole structure, few twin nodes, 
and low bipartivity. But they are not very strong mo- 
tifs since they occur also frequently in the rewired net- 
works. The three strongest motifs with n = 8 and t — YI , 
in contrast, have many twin pairs and high bipartivity. 
They have slightly lower counts (by factors 2-4), but oc- 
cur much more rarely in the rewired networks. 



3. Effects of Rewiring on Differences between Networks 

Finally, Fig. 1151 shows counts for individual subgraphs 
in the E. coli network against counts for the same sub- 
graph in yeast. This is done for all four combinations of 
original and rewired networks. We see that the correla- 
tion is strongest when we compare rewired networks of 
E. coli to rewired networks of yeast. This is not surpris- 
ing. It means that a lack of correlations is mostly due 
to special features of one network which are not shared 
by the other. Rewiring eliminates most of these features. 
The other observation is that rewiring in general reduces 
further the counts for subgraphs which are already rare in 
the original networks. This is mainly due to the fact that 
such subgraphs are relatively densely connected, and ap- 
pear in the original networks only because of the strong 
clustering. This effect is more pronounced for yeast than 
for E. coli, because it is more sparse and has more densely 
connected clusters/complexes. 



13 



VI. DISCUSSION 

In this paper we have presented an algorithm for sani- 
pHng connected subgraphs uniformly from large net- 
works. This algorithm is a generalization of algorithms 
for sampling lattice animals, hence we refer to it as a 
"graph animal algorithm" and to the connected sub- 
graphs as "graph animals" . It allowed us to obtain high 
statistics estimates of subgraph censuses for two pro- 
tein interaction networks. Although the graph animal 
algorithm worked well in both cases, the analysis of the 
smaller network (E. coli) was much easier than that of 
the bigger (yeast). This was not so much because of the 
sheer size of the latter (the yeast network has about ten 
times more nodes and links than the E. coli network), 
but was mainly caused by the existence of stronger hubs. 
Indeed, the presence of hubs places a more stringent lim- 
itation on the method than the size of the network. 

One of the main results is that many subgraph fre- 
quency counts are hugely different from those in the most 
popular null model, which is the ensemble of networks 
with fixed degree sequence. Based on a comparison with 
this null model, most subgraphs with size > 6 in both net- 
works would be very strong motifs or anti-motifs. This 
clearly shows that alternative null models are needed 
which take clustering and other effects into account. 

While this was not very surprising (hints of it had been 
found in previous analyses), a more surprising result is 
the fact that the dominant motifs in the two protein in- 
teraction networks show very different features. Most of 
these seem to be related to the densely connected cores 
of a small number of complexes in the yeast network, 
which have no parallels in the E. coli network and which 
strongly affect the subgraph census. Further studies are 
needed to disentangle these effects from other - possibly 
biologically more interesting - effects. 

Finally, a feature with likely biological significance 
is the dominance of subgraphs with many twin nodes. 
These are nodes which share the same list of linked neigh- 
bors within the subgraph. They correspond to proteins 
which interact with the same set of other proteins. The 
most natural explanation for them is gene duplication. 
Connected to this is a preference for (approximately) bi- 
partite subgraphs. These two features are very clearly 
seen in the E. coli network, much less so in yeast. But 
it would be premature to conclude that gene duplication 
was evolutionary more important in E. coli than in yeast. 
It is more likely that its effect is just masked in the yeast 
network by other effects, most probably by the densely 
connected complexes and other clustering effects which 
do not show up to the same extent in E. coli. 

Up to now, wc know very little about the biological 
significance of our findings. One main avenue of further 
work could be to relate our results on subgraph abun- 
dances in more detail to properties of the network that 
are associated with biological function. Another impor- 
tant problem is the comparison between network recon- 
structions which supposedly describe the same or similar 
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Cq / <c^ 'q> for Krogan et al. 

FIG. 16: (color online) Count ratios cg/{cq^ for individ- 
ual subgraphs in the yeast networks of Refs. [63l. |6^ . plotted 
against counts for the same subgraph in the network of [47|. 
If all three networks were identical, all points should lie on 
the diagonal (indicated by the straight dashed line), whereas 
in fact systematic deviations are observed. 



objects. There exist, e.g., a large number of published 
protein-protein interaction networks for yeast. Some 
were obtained by means of different experimental tech- 
niques, either with conventional or with high through- 
put methods, while others were obtained by comprehen- 
sive literature compilations. In a preliminary step, we 
compared three such networks: The network obtained 
by Krogan et al. [13] that was studied above, a some- 
what older network downloaded from [g^l and attributed 
to Bu et al. [6^, and the 'high confidence' (HC) network 
of Batada et al. The latter is the most recent. It 

was obtained by extracting the most reliable interactions 
from a vast data base which includes the data of both Bu 
et al. and Krogan et al.. In Fig.[T6]we plot the ratios be- 
tween the actual counts and the average counts in rewired 
networks for Bu et al. and for the HC data set against the 
analogous ratios for the Krogan et al. networks. If the 
three data sets indeed describe the same yeast network - 
as they purport to do, within experimental uncertainties 
- the points should all fall onto the diagonal. Instead, we 
see systematic deviations. Surprisingly, these deviations 
are much stronger between the Krogan et al. and the HC 
networks than between the Krogan et al. and the Bu et 
al. networks. Clarifying these and other systematic ir- 
regularities should give valuable insight into the strengths 
and weaknesses of the methods used in constructing the 
networks as well as their biological reliability, and should 
lead to improved methods for network reconstruction. 

In the present paper we have only dealt with undi- 
rected networks. The basic sampling algorithm works 
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equally well for directed networks. The main obstacle 
in applying our methods to the latter is the huge num- 
ber of directed subgraphs, even for relatively small sizes. 
Nevertheless, we will present an analysis of directed net- 



works in forthcoming work, as well as applications to 
other undirected networks. 

Acknowledgements: We thank Gabriel Musso for valu- 
able information on the yeast network. 
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